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Abstract. - Investigations of molecular bonds between single molecules and molecular complexes 
by the dynamic force spectroscopy are subject to large fluctuations at nanoscale and possible other 
aspecific binding, which mask the experimental output. Big efforts are devoted to develop meth- 
ods for effective selection of the relevant experimental data, before taking the quantitative analysis 
of bond parameters. Here we present a methodology which is based on the application of graph 
theory. The force-distance curves corresponding to repeated pulling events are mapped onto their 
correlation network (mathematical graph). On these graphs the groups of similar curves appear 
as topological modules, which are identified using the spectral analysis of graphs. We demonstrate 
the approach by analyzing a large ensemble of the force-distance curves measured on: ssDNA- 
ssDNA, peptide-RNA (system from HIV1), and peptide-Au surface. Within our data sets the 
methodology systematically separates subgroups of curves which are related to different inter- 
molecular interactions and to spatial arrangements in which the molecules are brought together 
and/or pulling speeds. This demonstrates the sensitivity of the method to the spatial degrees of 
freedom, suggesting potential applications in the case of large molecular complexes and situations 
with multiple binding sites. 
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Introduction. — It has been recognized recently [1] 
that the signals generated at a single molecule (or another 
nano-size object) differ from signals obtained in large-scale 
systems consisting of ensemble of molecules. In particular, 
enhanced fluctuations, randomness and ^reproducibility 
of the signals are observed in single-molecule measure- 
ments. A representative example is the mechanical sig- 
nal generated in the dynamic force spectroscopy (DFS) of 
molecular bonds [2,3]. The force spectroscopy of single 
molecules and molecular complexes has become a lead- 
ing methodology for measuring biomolecular unbinding 
forces, which form the bases of biologically relevant molec- 
ular processes [4]. For instance, some recently studied 
examples include measurements of fundamental biomolec- 
ular forces in DNA unzipping [5], ALCAM-ALCAM [6], 
peptide- antibody [7], RNA- protein [8] interactions, etc. 

In a typical pulling experiment in DFS based on 
the Atomic Force Microscopy, the ligand and receptor 
molecule are attached via polymer linkers on the AFM 



tip and the solid support (c.g glass, mica, gold- surface). 
The molecules are brought close to each other for certain 
contact time allowing them to form a bond and then pulled 
away until the bond breaks. The process is repeated many 
times. In each pulling event, changes in the deflection of 
the AFM cantilever as a function of distance are measured. 
Knowing the spring constant of the cantilever, this data 
can be calculated into distance dependent forces, resulting 
in so called force-distance curves. From these curves dif- 
ferent parameters can be obtained, such as force needed to 
break a certain bond and the force loading rate. Further 
quantitative analysis of these data sets requires an elab- 
orated theoretical framework [2] in order to extract the 
parameters of binding potential, the bond strength and 
the survival time. The applied force reduces the barrier 
between the bound and dissociated states of the binding 
molecules, allowing to estimate the force at which the bar- 
rier disappears and to measure the dissociation rate X(F). 
Then the natural dissociation rate at vanishing force is ex- 
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tracted using an appropriate theoretical framework. Since 
the pioneering work [2] , assuming the exponential depen- 
dence of the dissociation rate on the applied force, the 
theory of the dynamic force spectroscopy evolved in two 
major directions: (i) incorporating a phcnomcnological 
distribution of the bond parameters [9], and (ii) assum- 
ing a specific type of the binding potential, from which 
the dissociation rate X(F) is computed exactly [10,11]. 

The pulling disruption process of the molecular bonds, 
for instance in constant-velocity experiments used in this 
work, is both nonlinear and stochastic. Nonlinearity, 
which is manifested in force-dependent loading rate F(F), 
can be related to the internal degrees of freedom of the 
(large) molecules and the spacers. Fluctuations in the dis- 
ruption force and loading rate originate from the diffusion 
of binding molecules in solvent, whose effects depend on 
the binding potential and pulling speed. Thus the bond 
parameters can be determined in terms of probability, us- 
ing the appropriate theoretical concept. In addition to 
the molecular bond of interest (specific binding), other 
non-specific processes may occur at the same contact time 
and range of forces: mis-binding or binding at different 
site, or unspecific interaction of the molecule with linkers 
and surface. It is, therefore, of great importance to be 
able to select the force-distance curves which are related 
to true binding prior to the quantitative analysis of the 
bond parameters. It is a general believe (see also discus- 
sion in [12]) that, apart from the fluctuations, the force- 
distance curves originating from the same binding process 
share strong similarity. Given a large amount of data in 
a particular experiment, this is a technically demanding 
task, which requires automated computational approach. 
Recently two methods were proposed based on pattern- 
recognition [13] and master-curve fitting [12]. 

In this work we present a new approach for system- 
atic selection of groups of mutually similar force-distance 
curves and demonstrate it on sets of data from peptide- 
RNA complex from HIV1. Our methodology is based on 
the theory of complex networks and their spectral anal- 
ysis [14-16]. Mapping multichannel datasets onto a net- 
work representation has proved as a useful tool [17] in 
the analysis of many complex dynamical systems, for ex- 
ample stock-market time series [18], gene-expression sig- 
nals [19,20], neural activity signals [21-23], climat phase 
fluctuations [24], and information traffic time series [25]. 
Compared to these examples, where the time-series are 
measured at each unit of an extended interacting system, 
our approach here is to map the force-distance signal 
measured in repeated experiments on the same molecu- 
lar complex with many degrees of freedom. To demon- 
strate our approach, we analyze a large dataset of force- 
distance curves measured under different conditions in 
RNA-peptidc complex (HIVl-Rev peptide binding at the 
high affinity site in the stem-loop of the RRE on viral 
RNA) [26] and, for comparison, a set of curves for ssDNA- 
ssDNA binding and two control sets. The filtered corre- 
lation matrix mapped onto a binary graph exhibits mod- 
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Fig. 1: Correlation matrix of different types of force curves, 
described in the text. Color intensity from blue (low) to red 
(high) indicates the strength of correlation dj between pairs of 
curves. Shown are correlations after matrix filtering and above 
a threshold CV, > 0.5. 



ular structure, with the subgraphs of nodes (curve index) 
grouped according to their similarity over the entire data 
sets. We then look for the force-curves in each of the sub- 
graphs and find the underlying reasons for their clustering. 



Description of the Experimental Data. We 

created the correlation matrix from all pairs of N = 1188 
force curves, that were pre-processed in the following way: 
contact part was removed, curves were corrected to the 
baseline, and only the part where interactions are expected 
was kept (first 200nm). Curves for the correlation ma- 
trix were selected from four different experiments with 
two different experimental setups. The experiential se- 
tups differ in the number of flexible linkers used to couple 
molecules to the surface and the cantilever: either both 
molecules were coupled to the cantilever/surface via PEG 
spacers or one was coupled to the gold surface directly 
and the other to the cantilever via PEG spacer. In the 
correlation matrix (see Fig. [I] and text below), from left- 
to-right, the first block (I) consists of 200 curves taken 
from measurements on RNA-Rev peptide interaction (two 
linkers, velocity 581nm/s); second block (II) consists of 
188 curves taken from RNA-Rev peptide interaction in 
the presence of neomycin (two linkers, velocity 581nm/s); 
third block (III) contains 200 curves measured on RNA- 
Rev peptide interaction (one linker, velocity 2540nm/s); 
fourth block (IV) contains 200 curves measured on ssDNA- 
ssDNA interaction (two linkers, velocity 218nm/s); fifth 
block (V) has 200 curves of RNA-Rev peptide interaction 
(one linker, velocity 1160nm/s), and the last block (VI) 
consists of 200 curves of Rev peptide- Au interaction (one 
linker, velocity 1160nm/s). 
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Correlation Matrix and its Spectral Analysis. 

— In our dataset consisting of N = 1188 force-distance 
curves, {fi{x}), i = 1,2, 3,- --N, each curve is identified 
by a uniquely defined index i, thus representing a sepa- 
rate pulling event. The elements of the correlation matrix 
Cij are calculated as the Pearson's correlation coefficient 
between each pair of curves as follows: 



Cij = 
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where the distance x is given by a discrete set of mea- 
sured values, and <n, o~j stand for the standard de- 
viation of the force signal fi(x) and fj(x). In Fig. 
Q] we show a 3D color plot of the correlation matrix 
of all force curves, after filtering out spurious correla- 
tions. For the filtering, we used the affinity transfor- 
mation method [22,25], where the element is enhanced 



as Cj . 



if the raws i and j correlate with the 



rest of the matrix in a similar way, yielding M ,, > 1, 
and diminished otherwise. The meta-corrclation factor 
Mij is computed as a Pearson's coefficient of the rear- 



, Ci-ij j Cj+ij i ■ • • i CV/v} and 
,Cjn} without diagonal. 



ranged elements {C^j, On, 
{Cji, Cji, Cj-u, Cj+n, ■ 

For further discussion we notice that the matrix can 
be represented by a network (mathematical graph), where 
each matrix index i = 1, 2, • • • N defines network node and 
the matrix element Cij — a link between nodes i and j. In 
our case the links are symmetrical Cij = Cji by definition 
|T]). Notice that the matrix and network representations 
are formally equivalent. The network picture is suitable 
for visualization and topological interpretation. In partic- 
ular, the matrix in Fig. [1] makes a sparse network contain- 
ing topological modules, i.e., groups of nodes with strong 
connections inside the group and sparse connections be- 
tween them. As the Fig. [1] shows, correlations between 
different sets of data (blocks I through VI) shown as off- 
diagonal block-matrices can be as strong as correlations 
inside the same data set (diagonal blocks). This suggests 
that the network modules may contain curves from differ- 
ent data blocks! In the following we apply the eigenvalue 
spectral methods to identify these topological modules. 

Here we perform spectral analysis of the normalized 
Laplacian operator L related to the filtered correlation C 
in Fig. [1] or more precisely, its binary form with the ele- 
ments: Aij = 1 whenever Cij > Co, or Ay = otherwise. 
The matrix elements of the Laplacian are given by [16,27] 



(2) 



where q^, qj are the number of links at nodes i and j, re- 
spectively. Although the same conclusions can be reached 
using the adjacency matrix Aj directly, the analysis of 
the Laplacian @ is more convenient since its eigenvalue 
spectrum is limited in the range Aj € [0, 2]. Moreover, its 
eigenvectors belonging to few lowest nonzero eigenvalues 
tend to localize along the network modules [15,16]. This 



Fig. 2: Scatterplot of the eigenvectors Vi, V2, V3 of three lowest 
eigenvalues of the Laplacian operator ([2]) related to the full 
correlation matrix in Fig. [T] Four branches are visible in this 
projection, marked as G\,G2,G3,G^. 
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Fig. 3: Overlap plot of all curves belonging to the groups 
Gi, G2, G3, G4 identified from the scatterplot in Fig. 



is a direct consequence of the orthogonality to the eigen- 
vector belonging to zero eigenvalue of the Laplacian (or 
the largest eigenvalue of the adjacency matrix) which has 
all positive components (Perron- Frobenius theorem [28]). 
Precisely, the localization means that, among TV compo- 
nents of the eigenvector, the nonzero (positive/negative) 
components have indexes which coincide with nodes in a 
network module (detailed analysis of spectra in modular 
networks is given in [16]). We use this property of the 
eigenvectors to identify nodes in different modules. 

Identification of modules: Force-curves group- 
ing. — When a modular structure occurs, the localiza- 
tion of eigenvectors belonging to smallest non-zero eigen- 
values is manifested in a branched pattern of the scatter- 
plot in the space of these eigenvectors [16]. In Fig. [2] 
we show the scatterplot of the eigenvectors (Vi,V2,Vs) 
belonging for three smallest non-zero eigenvalues of the 
Laplacian j2]), related to the correlation matrix in Fig. 
[T] In the scatterplot, each point carries one index, thus 
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Table 1: (upper part)Representative curves appearing at tips 
of four branches G\, G2, G3, G4 in the scatterplot in Fig. [2] and 
their distribution over original blocks od data I through VI. 
(Middle and bottom) Further splitting of groups G2 and G3. 
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II 


III 


IV 
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VI 


Gl 
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16 








G2 


51 


28 


177 





181 


148 


G3 


60 
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135 
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G4 
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3 
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22 
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20 


10 


5 





3 


21 


G3-gl 
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6 





24 
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G3-g2 


24 


9 





24 





5 


G3-g3 


18 


15 





28 
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Fig. 4: Example of the correlation network constructed from 
the force curves in group G2 from Fig. [2] exhibiting modular 
structure. Shown are only links above the threshold Co = 0.91. 

indicating one network node, that is a force curve index 
in our original dataset. The plot in Fig. [2] shows four 
branches, marked by Gi, G2, G3, G4, thus four groups of 
curves can be identified. The most representative curves 
in each group are those at the tips of branches with most 
distinct curves sitting at the opposite ends of the branches 
G2 and G3, whereas, the differences gradually diminishes 
closer to the center of the plot. By matching the indexes 
with curves in the original dataset, we identify representa- 
tive curves at the tips of four branches (Table 1.). Overlay 
of all curves in the G\ ■ ■ ■ G4 groups is shown in Fig. [3J 

Each of the groups contains number of curves from one 
or several blocks of the original datasets. Recalling the na- 
ture of the data in different blocks, we see that the module 
G2 contains curves obtained predominantly on experimen- 
tal setup with one PEG spacer (blocks III, V, and VI) 
and a fraction of data with two spacers (from blocks I and 
II), while excluding altogether the DNA-DNA interactions 
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Fig. 5: For the branch G2 of Fig. [2] (top) the eigenvalue spec- 
trum and (bottom) the scatterplot of three eigenvectors be- 
longing to lowest nonzero eigenvalues. Three subgroups are 
identified marked by gl,g2,g3, see text for details. 

(block IV). On the other hand, the module G3 contains 
data from DNA-DNA interactions and a number of curves 
from Rcv-RNA and blocked Rev-RNA with neomycin, all 
data obtained using two spacers. G\ has similar composi- 
tion although it appears as a separate module, similar as 
the small group G4 (see visual differences in Fig. [3]). 

We further analyze the group of curves in G2 applying 
the same approach on the now reduced correlation matrix. 
The complete group consists of N2 = 597 curves. The cor- 
relation network of these curves, shown in Fig. 2J exhibits 
modular structure, suggesting that smaller subgroups of 
the group G2 can be identified. The eigenvalues of the 
Laplacian related to this correlation matrix are shown in 
the ranking order in Fig. [5l top panel. Compatible with 
the network modularity in Fig.[4]is the appearance of three 
eigenvalues in the gap between A = and the main part of 
the spectrum. The corresponding scatterplot in the space 
of three eigenvectors belonging to these eigenvalues is also 
shown in Fig. [5j bottom panel. In this case similarity be- 
tween points, forming a half- helmet in (Vi, V2, V3) space, is 
stronger compared to Fig. [2 however, groups can be iden- 
tified by the end points of the half-ring in the horizontal 
plane and points with largest vertical distance, marked as 
<?i > 92 j 53 and different symbols (colors) in Fig. [5] 

The identity of these curves with respect to the origi- 
nal datasets is also indicated in Table 1, middle part. As 
mentioned above, curves in large group G2 are on Rev 
peptide binding, whereas, their subgroups appear to be 
from different pulling velocities: G2 — gl consists mostly 
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Fig. 6: Overlay of the selected force curves of low- and high- 
velocity associated to ends of the scatterplot ring in Fig. [5] 

of curves in blocks V and VI, measurements at velocity 
of 1160 nm/sec, while the groups Gi — <?2 and G2 — <?3 
at two ends of the semi-ring contains curves measured at 
highest (2540 nm/sec) and lowest (581 nm/sec) velocity, 
respectively. In the experiment, measurements at differ- 
ent velocities and correct assignments of the curves are 
important for the extrapolation of bonding parameters 
to zero force values. Pulling velocity affects both force 
loading rate F(F) as well as the average disruption force 
(F). Within our methods selected groups of high- and 
low-velocity curves, G2 — g2 and G2 — g3, respectively, are 
shown in Fig. [6] Similar analysis of the data contained 
in the group G3 (not shown) leads to the curves identi- 
fied in the lower part of the Table 1. Here data contain 
three different interactions (DNA-DNA, Rev-RNA, and 
Rev- RN A- with- neomycin) , all measured in the setup with 
two spacers and low velocity. We find much larger mixing 
between the groups, indicating either increased number of 
aspecific binding or a dominant role of spacers. 

Finally, in Fig. [7] we show histograms of the rupture 
forces made upon curves in groups G2 and G3 selected by 
our methodology. Detailed analysis of Rev-RNA bonding 
parameters is given in [26] . 

Conclusions. — We have shown that stochastic sig- 
nals of different molecular systems (peptide-RNA, DNA- 
DNA) measured in the dynamic force spectroscopy exper- 
iments can be effectively selected according to their simi- 
larities. Our methodology uses the signal's relevant corre- 
lation matrix mapped onto a mathematical graph. Then 
using the spectral analysis of these graphs the groups of 
similar curves are detected, which appear as topological 
modules on them. Although the method is essentially sta- 
tistical, we have shown that strong regularities in group- 
ings of the force curves occur (and can be effectively used 
for the signal evaluation), based on the pulling speeds and 
experimental setup and even type of the interaction mea- 
sured. Note that for the demonstration purposes in this 
work we used raw data with minimal pre-processing. Fur- 
ther pre-processing, e.g., with cuve-fitting [12], could be 
incorporated and increase selectivity in our method. Im- 
proved efficiency in the evaluation of force curves by this 
approach is also expected in the case of larger molecules or 
molecular complexes and the situations where distinction 
between many binding sites is applicable. 




Fig. 7: Histograms of rupture forces from selected curves in 
blocks VI, V, III from group G2 (left) and from blocks I, II, IV 
from group G3 (right). ( Cf. Table 1. and data description.) 
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